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Recently, there has been growing debate as to whether or not static analysis can be truly sound. In spite of this 
concern, research on techniques seeking to at least partially answer undecidable questions has a long history. 
However, little attention has been given to the more empirical question of how often an exact solution might 
be given to a question despite the question being, at least in theory, undecidable. This paper investigates this 
issue by exploring sub-Turing islands — regions of code for which a question of interest is decidable. We define 
such islands and then consider how to identify them. We implemented Endeavour, a prototype for finding 
sub-Turing islands and applied it to a corpus of 1100 Android applications, containing over 2 million methods. 
Results reveal that 55% of the all methods are sub-Turing. Our results also provide empirical, scientific evidence 
for the scalability of sub-Turing island identification. 

Sub-Turing identification has many downstream applications, because islands are so amenable to static 
analysis. We illustrate two downstream uses of the analysis. In the first, we found that over 37% of the 
verification conditions associated with runtime exceptions fell within sub-Turing islands and thus are statically 
decidable. A second use of our analysis is during code review where it provides guidance to developers. 
The sub-Turing islands from our study turns out to contain significantly fewer bugs than “the swamp” (non 
sub-Turing methods). The greater bug density in the swamp is unsurprising; the fact that bugs remain prevalent 
in islands is, however, surprising: these are bugs whose repair can be fully automated. 
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1 INTRODUCTION 


This paper seeks answer the following fundamental question at the intersection of programming 
languages theory and empirical software engineering: 


What portion of the code of a large corpus of real software systems lies in Sub-Turing is- 
lands; ‘islands’ of code that denote computation for which interesting program analysis 
questions are decidable? 


We use the term “Turing Swamp’ to refer to any code that does not lie in such a Sub-Turing island. 
Of course, merely determining whether or not code lies within an island or in the swamp is, itself, 
undecidable. Therefore, our tool uses a simple conservative under-approximation of Sub-Turing 
islands (and corresponding over-approximation of the swamp). 
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Our Sub-Turing island identification algorithm, Cook, guarantees that the halting problem is 
decidable for any computation it identifies as lying within an island; as a result, Cook necessarily 
under-approximates the amount of code that lies within such islands. Even with this relatively 
simple under-approximation, we were able to determine that a large proportion of non-trivial 
production code (for Android) does indeed lie in island (not swamp) code. That is, for a corpus of 
1100 Android applications, containing over 2 million methods, we found that 55% of the methods 
are sub-Turing. 

Even if we remove the ‘long tail’ of simple methods (like getters and setters and methods with 
fewer than 30 bytecode instructions) we still find that 22% of all code lies in a Sub-Turing island. 
We then ask 


Since we find that at least one fifth of non-trivial real-world systems lies in a Sub-Turing 
island, what are some of the ramifications for programming languages and software 
engineering applications that rely on static analysis? 


To investigate these implications we conducted two empirical studies of the impact for sub- 
Turing islands. Even with our conservative under-approximation, we found that (at least) 37% of 
the verification conditions for runtime exceptions (e.g., array bounds and null pointer violations) lie 
within sub-Turing islands. Furthermore, (for a dataset of ten open source applications), we found a 
statistically significant difference in bug density, with a large effect size. 

These findings reveal a glimpse of the potential implications and applications of Sub-Turing 
analysis. In a single paper we cannot claim to have addressed more that the first few natural 
questions that occur when considering the approximate computation of the boundary between 
Sub-Turing islands and the swamp. Nevertheless, we believe that our results demonstrate that a 
surprisingly large portion of code does clearly lie within a Sub-Turing island and that there is 
practical merit in studying islands to inform and improve static analysis. Sub-Turing Islands support 
fully automatic and precise symbolic reasoning; this reasoning might be exploited for bug repair 
and free humans to concentrate problems occurring in the swamp. 

There are many avenues for future work. We outline some of these and their relationship to 
existing trends of intellectual investigation in the programming languages and software engineering 
research communities. We hope that this paper will stimulate the further investigation of Sub-Turing 
analyses of software and real-world applications of these findings. Our paper seeks to motivate 
this research agenda with scientific evidence for the prevalence of Sub-Turing islands (within 
Android applications in this case) and the real-world impact and implications for bug density and 
verification. 

Specifically, the contributions of this paper are the following. 


e We introduce and formalise the concept of sub-Turing island. 

e We provide an analysis for identifying sub-Turing islands and its implementation in the 
prototype tool Endeavour. 

e We reveal that Sub-Turing code is more prevalent in real-world system than might be expected: 
a conservative lower bound is at least one fifth of non-trivial Android App code is Sub-Turing. 

e We demonstrate that Sub-Turing island analysis has great potential for real-world application. 
Specifically, we report that 37% of the array bounds and null pointer verification conditions 
lie within islands, while islands enjoy lower bug density than the Turing swamp. 


2 SUB-TURING ISLANDS 


This section first defines Sub-Turing Island where the definition is parameterized by a decision 
procedure. As an illustrative decision procedure, we consider terminating islands, Sub-Turing islands 
with a conservative decision procedure for halts versus may not halt. The bulk of this section then 
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presents the syntax and semantics of Carib, a core language that facilitates the identification of 
Sub-Turing islands. 


Definition 2.1 (Sub-Turing Island). A region of code r is sub-Turing with respect to property p if 
there exists a decision procedure D(p, r) that determines whether p holds over all executions of r. 


Under Rice’s theorem [50], Sub-Turing islands are only computable when D approximates ~p 
while being certain when it determines p holds; it cannot decide both p and ~p. For an arbitrary 
property, a suitable decision procedure may not exist, hence the existential quantification in 
the definition of sub-Turing. Finding a decision procedure is based on human ingenuity. The 
parameterisation of the definition on decision procedure D, implies that sub-Turing islands are only 
defined with respect to a given decidable property. Different static analyses can safely approximate 
the islands, and with different levels of precision, thereby giving rise to the generation of different 
islands. However, if any approximation safely under-approximates the code that lies within an 
island then it will be safe to make ‘island-aware assertions and inferences’ within any given island. 

We focus our investigation on a decision procedure for the halting problem. Our realisation, 
given in Section 3.2, soundly approximates this undecidable problem. Given this decision procedure, 
we frame the Terminating Islands Identification Problem in terms of states, o : L — V,, where L 
denotes the set of program l-values and V, denotes the lifted value domain V, which we leave 
otherwise unspecified. We say that a state, o is divergence free if none of its |-values are mapped 
to L. Given a divergence-free state o and a region of code r, our goal is to determine if [r]°o is 
also divergence-free; we formalize [r]°o, the semantics of Carib, in Section 2.1. From a practical 
perspective, we are interested in regions that represent meaningful code fragments such as a 
method or a loop body. 


Definition 2.2 (The Terminating Island Identification Problem). Given a region of code r, the 
Terminating Island Identification Problem is to determine if 


Vo € È. (Yx € L. olx] +L => Vx eL. ([r] o)[x] + L). 


When this condition is satisfied, divergence can only result from the code that makes up r. In this 
sense, r is independent of its context (the rest of the program). Since we consider only a decision 
procedure for termination in this paper, we use Sub-Turing Island to refer to a Terminating Island 
in what follows. 

To illustrate the goal of our analysis, consider the three examples shown in Figure 1 where 
the region of code considered is a method. To being with, method foo of Figure 1a calls method 
bar, which is clearly sub-Turing; thus it is not a source of divergence to its caller foo, which is 
also sub-Turing. In contrast, in Figure 1b, callee bar is not sub-Turing as it contains a loop whose 
termination can not be guaranteed. As a result its caller foo is also not sub-Turing. Finally, Figure 1c 
is similar to Figure 1b except that the source of divergence is a call to an API method, which may 
diverge (A call to a recursive method would have the same effect.) The potentially divergent API call 
does not, however, relegate Line 4 to the swamp, despite its control dependence on the call to bar. 
This is because a Carib function always terminates, so the fact that a Line 4 is termination-sensitive 
control dependent on Line 3 does not matter [4, 20]. 


2.1 Carib: Its Syntax and Semantics 


Figure 2 defines G, the grammar of Carib, our core language. Carib is a Jimple-like [63] intermediary 
representation with a minimal set of instructions. Vallée et al. [63] and Bartel et al. [8] have shown 
that Jimple can encode the entire instruction set of widely deployed virtual machines, such as 
the JVM and Dalvik. A Carib program is a set of methods derived from the nonterminal PROG. 
Carib incorporates three simplifications that ease the presentation. First, instead of the usual 
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1 void foo(){ VY 


1 void foo(){X , ine x= 5: 


2 i = 5; 
1 void foo(){Y pasa , 3 unused = bar(); 
A 3 x = bar(); 
int x = 5; keturi x 4 return x; 
3 x = bar(); } : 5 } 
return x; , 6 
5 } | void bart X 7 void barO{ X 
6 , myo 8 int y = Q; 
: ` b , » 5 = : . 
int arO{ Y | while (C){ r api(); 
8 int y = 1; eo 10 if (r) 
9 return y; i ie 11 y = ytl; 
11 } 
w0 } 12 } 
12 return y; 
13 return y; 
3 o} 
“u } 
(a) (b) (c) 


Fig. 1. Three examples illustrating the outcome of our Sub-Turing analysis. Symbol W indicates that the 
method is sub-Turing and X indicates the opposite. In the example C is a condition that cannot be proven to 
eventually be false while api is a call to an unknown API method. 


PROG ::= METHOD,::: , METHOD 
METHOD := id(id,--- ,id) STMT 
STMT ::= STMT;STMT | ASSIGN | IFE | WHILE | CALL | return id 
ASSIGN ::= id:=c | id:= id | id := op id | id := id op id 
id := id.id | id.id := id | id := id[id] | id[id] := id 
IFE = if COND then sTMT else STMT 
WHILE := while COND do STMT 
CALL == id := id(id,--- ,id) 
coND ::= id rel_op id 


Fig. 2. The grammar G for Carib, our core Jimple-like language. 


syntax for method invocation o.m(---), Carib uses the form m(o,---), with the receiver being 
the first argument. Second, Carib defines only the two structured control constructs while and 
if-else. Finally, to simplify reasoning about side-effects, Carib restricts pointer dereferences to its 
assignment statement, where the dereference operator can only appear either in the LHS or RHS 
alone. Further, the call syntax only permits an id as an actual parameter, again ruling out pointer 
dereferences. These properties simplify reasoning about aliasing in Carib. 

Carib’s semantics, |:]°, extends a conventional semantics, |-], such as Winskel’s IMP [68] where 
[s] is a partial function from È to È that updates the state to reflect the execution of s. For s € L(G) 
(Figure 2), [[s]° is identical to the conventional semantics, |s], when s terminates. When s does not 
terminate, |s]° reifies the nontermination by binding L, to each variable modified by s. 

To reify nontermination, |s|" must first identify it. In Carib there are three potential sources: 
loops, recursive method calls, and (unknown) API calls, The semantics uses three oracles in the 
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w(s) set of all l-values potentially written by the execution of s 
O;(s) the termination oracle used to check for nontermination of s 


A the set of (assumed divergent) API methods 

R the set of recursive methods O; deems divergent 

fv(e) the set of free variables in expression e 

ret a fresh pseudo-variable used to hold a method’s return value 


Fig. 3. Symbols and functions used to define Carib’s semantics in Figure 4; the first three are parameters. 


identification: O+, w, and A (Section 3 discusses our computable approximations to these three). For 
example, the termination oracle, O; is used to identify nonterminating loops as well as recursive 
methods that may diverge. Figure 3 defines these oracles and other symbols and functions used to 
define Carib’s semantics. 

Finally, we formalise the notions of state and state update as used in the Carib semantics. As a 
convenience, we assign a unique name to each local variable and formal parameter, and then simply 
refer to only those names that are in scope. We use L to denote the set of all program I-values (in 
Figure 2, these include identifiers id and array/structure references, id[id]). An l-value denotes a 
memory location that holds a value from the lifted value domain, V, , where L denotes divergence; 
we leave V, otherwise unspecified. A program state o : L — V, maps each I-value x to its value vy. 
È denotes the (possibly infinite) set of all program states. We write o[x] to denote the value that o 
maps x to and o[x := v] to denote the updated state o’ where o’[x] = vando’[y] = o[y],x + y. As 
a notational convenience, we write o[X] for variable set X to denote {o[x] | x € X} ando|Y := Z] 
as shorthand for o[y; := ofz], -++ , yx := ol zx] J, with Y = (Y1,°°* 5 Yk) and Z = (Z1,°++ Zk) and 
yi, Zi € L. Finally, we write o[Y := v] to denote oly := v, +++ , ye := v]. 

Carib’s semantics must account for two potentially divergent constructs, WHILE and CALL. Ifa 
loop does not terminate, our semantics effectively replaces the loop with a parallel assignment 
of L to all the l-values that the loop may modify. We handle recursive calls in the same way. 
Other constructs in L(G) may propagate L, but will not introduce it. In essence, our semantics is a 
collecting semantics based on taint analysis where WHILE and CALL are the only taint sources. We 
say that a program point is in the swamp if L reaches it, otherwise the point is a sub-Turing island. 
Finally, we emphasis that, under its semantics Carib methods always terminate. 

Figure 4 presents Carib’s semantics, [.]°. The rule for wHILe leverages O; to determine if a loop 
terminates. For a loop s that may not terminate, the first line of wuILE rule binds L to each l-value 
potentially written during the execution of s. The externally supplied function w(s) identifies 
these 1-values. Section 3 describes our conservative realisation of O, as well as our conservative 
determination of the set of written l-values. 

The second source of non-termination is calls to recursive methods and API methods, denoted R 
and A in Figure 4. For the cALt rule, the first case binds L to all l-values that the called method’s 
execution potentially updates together with r, the variable receiving the method’s return value. 
In CALL’s second case, conventional semantics apply. Here, body(m) denotes the statements of 
the called method. Working outward from [body]‘, we evaluate m’s body on the state formed by 
binding m’s formals to the actuals found in the call. Other than the return value, there is no need to 
map information back to the caller because L(G) uses call by value semantics. In the case of objects 
(and arrays), Carib passes a copy of the reference to the object to the callee thereby allowing the 
called method to update (only) the members of the class (or array) associated with the actual. To 
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a 


SEQUENCE : [51; s2 


do. [s21 (lsi 0) 


ASSIGN: |s = v := e 


: fs olw) := 1] if. e olfule)] 


Is] otherwise 
Ao. olw(s) = if 
IFE : [s = if e then s else s2]° = Ceol a2) freee) 
Ac. fe] o ? [si]® o: [s2]° o otherwise 
Ao. ofw(s) := o 
WHILE : |s = WHILE e do sı]° = aA W=] NEE CLIVE) 
Ac. fe] o ? [s1; WHILE e do sı] o :o otherwise 
— 
Ao. o[w(s) := Lr := L] meAUR 
> , a s 
whera m(X)]° _ JAo. let mbe m(Y), o’ = [body(m)]° oL Y := X] 
in o[r := o’ [ret] | otherwise 


where body(m) denotes the body of method, m. 


Fig. 4. Carib’s denotational semantics: WHILE and CALL can introduce divergence (L); the other equations 
only propagate it. 


handle return values, we store the value in o’[ret] and then update the final state to bind r to this 
value. 

Finally, the IFE and AssIGN rules can only propagate L; they do not introduce it. For ASSIGN, if 
L reaches any variables in e, Ass1Gn binds L to v; otherwise, the conventional semantics [s] is 
used. When L reaches an if statement’s conditional expression the IFE rule assigns L to all l-values 
potentially written by either branch of the if statement; otherwise, it applies [-]]© to the appropriate 
branch. 


3 REALISING ORACLES AND TRANSLATING TO CARIB 


To analyse industrial programs, we must translate them into Carib and instantiate Carib’s three 
oracles: its writable location oracle w, its termination oracle O+, and its externally defined, set of 
divergent API methods A. To handle constructs Carib does not define, we translate to Carib in the 
obvious way, conceptually de-sugaring them. For A, we assume a user-supplied list. Below, we 
describe how we realize w and O; for Android bytecode. Realising these oracles is not enough. To 
apply the Cook analysis to actual programs, we also need to change the semantics of their potentially 
divergent constructs, like loops and method calls. We achieve this via a program transformation ¢ 
that replaces each potentially divergent construct c with parallel assignments of L to the l-values 


of w(c). 


3.1 Soundly Identifying Potential Writes 


Realising w requires finding writable locations, both syntactic l-values and what can be reached 
through them. We currently harvest syntactic l-values from assignment statements without consid- 
ering their feasibility. Computing reachable l-values would require handling aliasing, which occurs 
when two l-values refer to the same object. To account for the possibility of aliasing in the analysis 
we use |-value representatives where two |-values are aliases if they have the same representative. 
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We want our Cook analysis (Section 4) to scale to large applications, so we need a sound 
and efficient handling of aliases. Instead of performing pointer analysis (Section 6.4), we use 
Sundaresan et al.’s approach [59] as it offers a simple and scalable solution. Their approach is 
based on the observation that only objects of compatible types can be aliases in a type-safe 
programming language, such as Java. Another advantage of this approach is its simplicity and ease 
of implementation. Sundaresan et al. use a flow-insensitive analysis because their main purpose is to 
keep track of object types. In our case, we need to track value transfer between variables. Therefore, 
we take inter-variable flow relations into account. For example, in the code x := y; y := Z,a 
flow-insensitive analysis captures data-flow from y to x and z to y and the spurious flow from z 
to x. Our approach does not include the spurious flow. Despite this additional precision, we still 
benefit, in terms of scalability, from Sundaresan et al.’s alias handling. 

Let Lr denote L augmented with abstracted locations created from the types of the subject 
program. Formally, we define R : L — Lr to map each I-value to its representative: 


T(o, f).f for field reference l := o. f 
R(T) = S R(a) for array reference | := ali] 
l otherwise (i.e., scalar references) 


where o is an object of type t, f is a field, a is an array, i is an index, and T (o, f) denotes the highest 
class in the type hierarchy of t that contains the field f. 

In Carib (Section 2), an l-value is either a variable, an array, or a field access. In the absence 
of an array or field dereference, the mapping is simply R(x) = x. The other two cases, a field or 
array access, are more involved. Two field accesses 0[ f] and o2[f] are aliases if 0; and oz point 
to the same object. To handle this case, all potentially aliasing field accesses must have the same 
representative. In type safe languages, like Java, 0,[ ] and o2[f] can be aliases only if 0; and o3 
belong to the same type hierarchy. While in principle if 0,[f] and o2[f] are aliases, then either 
suffices as the representative, for ease of identification, we include in ’s range representatives 
based on type names, specifically T(o, f). 

An array access |-value can alias for two reasons: reference and index. In the reference case, a[i] 
and D[i] alias if a and b alias. Alternatively, a[i] and a[j] alias if i = j. To take both into account, 
we perform a lightweight alias analysis that partitions array terms into parts of potential aliases. 
Given an array a, A(a) returns the representative of a’s alias part. Defining the representative of 
a[i] as A(a)[i] solves the problem of reference-induced aliasing, but not index-induced aliasing. 
Tracking indexes may generate an unbounded number of terms when indices are modified inside a 
loop. Therefore, R’s third case conservatively assumes all indices alias. 

Using R and Ivals(s), the set of all syntactic l-values in s, we approximate w as 


W(s) = U R(x). 
x €lvals(s) 
This realisation of w, w, assumes Ivals can access s’s internals. This assumption does not hold, in 
general, for API calls whose implementation can be externally defined in a black box library. Such 
API calls are prevalent in real world code. To handle them, we use a second instantiation of w that 
we describe in Section 3.3 where we first use it. 


3.2 Identifying Divergent Loops and Calls 


Realising Carib’s semantics for real bytecode demands that we first identify loops. Since bytecode 
permits unstructured loops, we implemented a loop detection analysis that searches for loops in 
the control flow graph of each method, which was experimentally shown to outperform existing 
alternatives [66]. This analysis is an optimised depth first traversal that discovers the control flow 
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Loop : s = while c do sı A7O;(s) > s > R(Y) := L where Y = w(s) 
REC: ME RAS=r:= m(X) > s> R(Y) := 1 where Y = w(body(m)[X /F]) 
ris m(X) > s> R(Y) := L;R(r) := L where Y= U RLV(x;) 


xiEX 


API: mMEAAS 


Fig. 5. The program transformations ¢ used to convert divergent constructs into parallel assignments of L; F 
used in REC is m’s formals. 


graph. This analysis also allows us to detect both simple and complex loops including nested loops 
and those constructed using gotos. A purely syntactic method, this analysis is complete (finds all 
loops), but unsound, in that it may report infeasible loops. 

Having identified loops, we turn to realizing O+. Despite the undecidability of loop termination 
in general, we can statically determine that some loop forms terminate. While simple, our oracle 
realisation is not purely syntactic. It simplifies a loop before checking its form. Let w be a loop 
that does not contain any nested loops, but otherwise has an arbitrary body. The execution of 
loop w can be expressed as a sequence of single-iteration cycles: C4, . . . , Cg. Our oracle concludes 
that w terminates iff both of the following hold: 1) each cycle increments a counter and 2) this 
counter is bounded in each cycle. These two conditions guarantee the existence of an increasing 
ranking function that is bounded from above, which is sufficient to ensure loop termination [43]. 
This instantiation of O,, which we denote O,, safely and conservatively under-approximates the 
set of terminating loops. 

Our corpus of Android apps includes 627,423 loops. Of these, our oracle shows that 330,894 (53%) 
terminate. Despite its simplicity and conservatism, O, identifies a large number of terminating 
loops. 

In Section 2.1, we use O; to populate R, the set of divergent recursive methods. O; works only 
on loops, which necessitates a separate mechanism for R. To do so, we conservatively build a 
call graph for the program using the class hierarchy approach [59] that provides a conservative 
approximation of the runtime types of receiver objects. We then identify recursive methods as 
those nodes belonging to strongly connected components in the call graph. To this end, we use 
Tarjan’s algorithm for detecting strongly connected components [60]. 


3.3 Rewriting Divergent Constructs 


To track divergence in a program, we identify divergent constructs and replace them with assign- 
ments of L to every l-value representative potentially modified. We do this as a source-to-source 
transformation ¢ : L(G) — L(G), which maps each statement s to a statement s’ that explicitly 
includes all necessary assignments of L. As Figure 5 depicts, we define ¢ using the three rewriting 
rules: Loop, REC, and Api. The Loop rule replaces the loop with a parallel assignment of L to all 
l-value representatives that the loop modifies. It uses w to identify these l-values. The REC rule does 
the same for calls to recursive methods. 

Handling loops and recursion makes us complete with respect to our language Carib (Section 2), 
assuming that a program is self-contained (i.e. does not make API calls). Most programs, however, 
make API calls to external libraries. Because an external library is a black box, we cannot syntacti- 
cally determine which of its parameters it may write through. Thus, we need a second instantiation 
of w to handle API calls. This second instantiation determines all the l-value representatives poten- 
tially modified using a given formal parameter. For arrays or objects, however, we must consider 
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void m(A a) 


{ 
class A { B b;} a.b = ...; 
class B {int x;} B tmp = a.b; 
tmp.x = ...; 
} 


Fig. 6. A Java example illustrating the I-values reachable from the formal parameter a. 


their fields. Let ‘fields’ be the set of fields of a variable where the empty set denotes a scalar. Given 
the formal parameter x, its reachable |-values are the set of l-value representatives given by the 
function RLV defined as follows 


RLV(x) = {RELI | f € fields(x)} |] REVAL SD. 
f €fields(x) 
RLV is irreflexive because, under Carib’s call by value semantics, actual parameters are immutable. In 
Figure 5, the Apr rule uses RLV to handle API calls where the set Y includes all l-value representatives 
reachable from m’s actual parameters, as determined by RLV. 

As an illustration of RLV, consider the code shown in Figure 6. To simplify the presentation, the 
code examples in the paper use to a more common Java-like syntax. The actual analysis is applied 
to bytecode, which is closer to Carib’s Jimple-like syntax of Figure 2; however, the more Java-like 
syntax better communicates the intuition behind our technique. In Figure 6, method m has a single 
formal parameter a and accesses its b field and subsequently b’s field x through the variable tmp. 
Thus, RLV(a) = {R(a.b), R(tmp. x)}, which, assuming A has no super classes, yields {A.b, A.B. x}. 

Our transformation ¢ applies the rules Loop, REC, and Api. We naturally extend ¢ to an entire 
program P, where each divergent construct s in P is replaced with ¢(s) in P’. 


4 COOK: DISCOVERING SUB-TURING ISLANDS 


This section introduces our analysis algorithm Cook, named after the British explorer Captain 
Cook. Starting from the basic knowledge that divergent constructs are clearly part of the swamp, 
we want to analyze their impact on other parts of the program. Let us write ¢,:x to refer to variable 
x at program location f; (e.g., at a given line number). We also write depend(€2:y, £1:x, s) to indicate 
that in the scope of statement s, variable y at location £} depends on variable x at location ¢;. In 
other words, modifying x at €; may modify y at £2. We over-approximate [-]° (Figure 4) with respect 
to divergence propagation via the following rule: 


fy:x=L depend(é,:y, £1:x, s) 


(DIVERGENCE PROPAGATION) 
xy =L 
This suggests an approach for identifying sub-Turing islands by applying a dependency analysis 
whose goal is to assign L to variables that depend on other divergence-affected variables. Broadly 
speaking, our analysis approximates the dependency relation induced by the program over variables, 
yielding an over-approximation of the swamp. Methods not in the swamp make up the sub-Turing 
islands, which we thus conservatively under-approximate. 
Figure 7 overviews Cook’s workflow and components. Cook takes as input a transformed program 
whose divergent constructs have been rewritten. Cook outputs a report indicating which methods 
are sub-Turing and which fall into the Turing swamp. 
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Fig. 7. Cook’s workflow and main components: check marks sub-Turing island methods; cross marks methods 
in the Turing swamp. 


While a sub-Turing island can be any code region, the islands we consider in the remainder of 
the paper are methods. Cook implements a bottom-up inter-procedural dependency analysis. It 
consists of two fix-point computations. The outer computation, Explore (Algorithm 1), operates 
over the whole program and calls the inner computation Landfall (Algorithm 2), to compute facts 
for methods. In what follows, we describe each algorithm in detail. 


4.1 Explore: Interprocedurally Searching for Sub-Turing Islands 


Starting from the transformed program, Cook is an inter-procedural taint analysis that propagates 
divergence. Cook assigns a method to the swamp if it uses a tainted variable when called in a 
divergence-free state. Thus, Cook considers only taints produced by the method or a method it 
transitively calls. 

In sub-Turing analysis of termination, nested loops (and recursion) can propagate bottom out- 
wards but enclosing loops cannot propagate it inwards. Otherwise, if non-termination were to be 
defined to propagate inwards, this would make the analysis of Islands often trivial and useless. 
For example, a loop-free reactive program, encased in a single non-terminating loop would often 
simply become ‘all swamp’. That would not be helpful for analysis: the body is loop free and so 
this body always terminates. It can be analysed as a terminating island of code, in isolation from its 
surrounding loop. 

In such a reactive system, figuratively speaking, the program is a single large ‘castle’ on an 
island surrounded by a ‘moat’ of swamp. Such a ‘swamp castle’ does not, itself, fall into the swamp. 
Pragmatically, this means that we could (and we argue, should) analyse and reason about the body 
of such a reactive system (which is loop free) in a very different way to the way in which we would 
reason about it as a whole component in a larger system. However, for our Cook analysis, the 
fact that taints do not propagate from the calling context means we cannot use an off-the-shelf 
solution [5]. 

Cook’s output is the set of sub-Turing methods. Cook is inter-procedural and needs the pro- 
gram’s call graph. Object-oriented languages, in general, have many features, such as method 
overriding, that make constructing an exact call graph at compile time impossible. Thus, Cook 
over-approximates the call graph using a class hierarchy approach [59] that conservatively approx- 
imates the runtime types of receiver objects. For an object o having a declared type C, its estimated 
types will be C plus all the subclasses of C. If C is an interface then its estimated types are all the 
classes implementing it and the classes derived from them. We use the notation 2 to represent the 
inheritance relation between classes (types); C 2 S means that S is a subclass of C. This relation is 
reflexive, thus C 2 C. Given an object o, the function rt returns all the types that o can potentially 
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Algorithm 1: Explore traverses its input program’s call graph, calling Landfall on each method, 
until it reaches a fix point where no method summaries change and it exhausts its workdlist. 
Input: Program P 
Output: P’s sub-Turing methods 
worklist := M(P) ; // M(P) is the set of P’s methods. 
Var map summary; 
Var set swamp := 0; 
foreach m € worklist do 
summary[m] := 0; 
while worklist + {) do 
m := pop(worklist); 
s’ := Landfall(P, m) ; // Algorithm 2 defines Landfall. 
if A(x, L) € s’ then 
swamp := swamp U {m} 


oOo © NyA A U N Re 


= 
© 


s’ := s’ — locals(m) ; // Remove m’s locals from its summary. 


if s’ + summary|m] then 

summary|m] := s’; 

/* CG(P) is P’s call graph. */ 
14 foreach m’ € M(P) e (m’,m) € CG(P) do 

15 | push(worklist,m’); // m’s summary changed, so we update its callers. 
6 return M(P) — swamp; 


re Pp oR 
ow N e 


m 


have at runtime. If the declared type of o is the class C then we have 
rt(o) = {C’| CDC}. 


Let function Impl(I) return all classes implementing interface I, including the implementations of 
subinterfaces of I. If the declared type of an object o is an interface I, then we have 


rt(o) = {C | CDC’ ACE Impl()} 


This means that we take into account all the classes implementing J, the ones implementing 
subinterfaces of I and their subclasses. For a method invocation o.m, the possible resolutions of the 
virtual method m at runtime is given by 


rt(o.m) = {C.m|C E rt(o) Ame C} 


We use a class name as a prefix to distinguish different virtual methods. We write m € C to indicate 
that method m is defined in class C and use s € body(C.m) to stipulate that statement s appears in 
method C.m. Finally, the call graph of a program P is given by 


CG(P) = {(C.m,C’.m')|Cé€ PAme CAo.m’' € body(C.m) A C’.m’ € rt(o.m’)} 


By C € P, we mean that the class C is defined in the program P. Hence the call graph represents 
the set of all possible pairs of (caller, callee) belonging to the given program. 

Leveraging the approximate call graph, Algorithm 1 implements Explore, Cook’s interprocedural 
algorithm. Explore takes a program transformed by ¢ (Section 3). Explore initializes a worklist to 
hold all the methods found in the program (Line 1) and associates empty summaries with each 
method (Lines 4-5). The swamp is also initially empty (Line 3). A summary for each method is 
then computed, by calling Landfall (Line 8), described below. Using the facts returned by Landfall, 
Line 9 tests if m belongs in the swamp. It does when its summary contains at least one element of 
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1 y = ð; 
2 if (x > 9) 
3 y =Z; 


Fig. 8. Code illustrating a case of control (implicit) dependency. Variable y is control-dependent on variable x. 


the form (x, L), meaning that Cook cannot safely, statically determine that it terminates. If this is 
the case, Explore adds m to the swamp. Function locals returns the set of l-value representatives 
corresponding to a method’s local variables. It is useless to keep such elements in a summary; Line 
11 discards them. If m’s summary has changed (Line 14), its entry is updated and its callers are 
placed on the worklist (Lines 15). Finally, on Line 16 Explore returns the set of sub-Turing methods 
as the complement of swamp against the set of all program methods. 


4.2 Landfall: Cook’s Intraprocedural Analysis 


Landfall (Algorithm 2) is an intra-procedural analysis. It approximates the dependence relation 
induced by a given method m over program variables. It uses the lifted set of l-values L; = LU {1} 
and an abstract interpretation over the domain D representing the powerset of pairs of l-value 
representatives: 
D = PURE), Rly) | x,y € Li}), 

where (1) is defined as L. Each pair (x,y) in D means that x depends on y with the use of R 
taking aliasing into account. We call the pair (x, y) a fact. Furthermore, the element (x, L) expresses 
that we cannot rule out the possibility that x might be affected by divergence. 

Landfall computes the transitive closure over elements from the domain D with respect to 
statements of method m using two auxiliary functions: control-dependence function control_dep 
and data-dependence function data_dep. The function control_dep captures control dependencies 
created by conditional statements. For example, consider the code shown in Figure 8. If in this 
example we only account for data dependencies, we conclude that variable y only depends on z 
errantly omitting x. However, if x is affected by divergence, we need to propagate this fact to y. 

Before describing how we compute control_dep, we introduce relevant terminology. Each method 
in the program is represented by a Control Flow Graph (CFG), a directed graph (N, E) where N 
is the set of nodes and E a set of edges. Each node represents either an assignment or a branch 
condition. The edges, E C N x N, represent control flow between program statements. In Carib, 
we map each assignment to a node with one successor and each conditional statement to a node 
with two successors, representing the true and false branches. For CFG node n, succ(n) is the set of 
successors of n, pred(n) its predecessors, and stmt(n) the statement n represents. Finally, each CFG 
includes two special nodes: entry(CFG) is the CFG’s unique entry node, which has no predecessors, 
and exit(CFG) is its unique exit node, which has no successors. 

To compute control dependencies, we use the well-established approach of Ferrante et al. [24], 
which we denote as control_dep(CFG, £, M) where CFG is a control flow graph, £ a location, and 
M a map associating locations with sets of facts. This function returns the set of facts induced by 
control dependencies for location £. Function control_dep includes transitive control dependencies. 

Turning to the data dependences, the function data_dep : D x L(G) —> D models the effect of 
program statements on elements of the abstract domain D. For a given fact d € D and statement 
s € L(G), data_dep is defined as follows: 


data_dep(d, s) ={(x, y) | dz. (x, z) € gen(s) A (z,y) € d} U (d—kill(s, d)). 
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Statement s gen(s) kill(s, d) 

id:=c 0 {(x,y) € d | x = id} 
id, := id, {(id;, id2)} {(x,y) € d | x = idı} 
id, := op idz {(id,, id2)} {(x,y) € d | x = idı} 
id, := idz op id; {(id;, idz), (id,, id3)} {(x,y) € d | x = idı} 
id, := idz| ids] {(id;, R(id[ids]))} {(x,y) € d | x = idı} 
id,[id2] := ids {(R (id; [id2]), ids )} 0 

return id {(ret, id)} 0 

r:= mY) summary(m)[Y/X, r/ret] d—{(x,y)€d|x=r} 
id := L {(id, L)} {(x,y) € d | x = id} 


Table 1. Definition of gen and kill for relevant Carib statements; for call statements, X contains m’s formals. 


where gen(s) is the set of dependencies locally induced by statement s. For example, gen(x := y + z) 
yields {(x, y), (x, z)}. Function data_dep transitively extends the relation represented by the input 
facts and the relation induced by the s. It also excludes (kills) facts that are no longer valid after the 
assignment. For example, 


data_dep({(x, t), (y, p)}.x := y) = {(x, p), (y, p)}- 


Since the assignment modifies x, the fact (x, t) no longer holds. Landfall transitively obtains the 
fact (x, p) from the input fact (y, p) combined with (x, y) from the assignment statement. 

We provide the definitions of functions gen and kill for Carib’s basic statements in Table 1. 
Assignments to simple variables (the first five cases) result in dependencies expressing how the 
assignment’s left-hand-side depends on the identifiers appearing in its right-hand-side except 
when the right-hand-side is a constant, which does not introduce any dependencies. When the 
right-hand-side is an object field or an array reference, we use its representative to take aliases 
into account. The return statement return id is modeled as the assignment ret := id, where ret is 
a special variable (see Figure 4) used to store and retrieve the method’s return value. In all these 
cases, we kill input facts expressing dependencies involving the assignment’s left-hand-side. 

In case of an assignment to a field or array element, we use its representative to take aliases into 
account. To preserve soundness, we do not kill any facts. Indeed, a representative over-approximates 
possible aliases. Therefore, the updated l-values may or may not be an actual alias of a given fact. 
For a call to a method m, we replace the formal parameters with the corresponding actuals in m’s 
summary, which is a set of facts expressing dependencies induced by m. We also replace the special 
variable ret with r. Landfall computes method summaries iteratively, on-the-fly when demanded by 
Explore. Finally, for the assignment id := L, we keep the fact expressing that the assigned variable 
is affected by divergence because the purpose of our analysis is to track the propagation of L. 

Landfall uses control_dep and data_dep in a standard worklist. The input and output of all nodes 
is initialized the empty set on Lines 4-5. Then, the entry node’s input is created on Line 6. New facts 
are produced by simulating the effect of program statements using the transfer function data_dep 
(Line 12), accounting for control dependencies (Line 13). When the set of facts associated with a 
given location n changes, all successors of n are explored again (Lines 14-16). The algorithm is 
guaranteed to terminate because L, is finite and so is the set of facts. Once a fix-point is reached, 
the algorithm returns the set of facts accumulated at the exit node. 
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Algorithm 2: Landfall approximates the dependency relation over program variables that its 
input method defines; Table 1 defines its gen and kill functions.. 
Input: Program P, method m 
Output: set of facts 
Var map IN, OUT; 
Let CFG be the control flow graph of m; 
Let L, be the lifted set of l-values appearing in P; 
foreach n € node(CFG) do 
| IN[n] := OUT[n] := 0; 
IN[entry(CFG)] = {(R(x), R(x)) | x € Li}; 
worklist := push(entry(CFG)); 
while worklist + () do 
n := pop(worklist); 
OUT; := OUT{n]; 
IN[n] := Unrepred(n) OUTI” ]; 
OUT[n] := data_dep(IN[n], stmt(n)); 
OUT[n] := OUT[n] U control_dep(CFG, n, IN); 
if OUT) + OUT[n] then 
foreach n’ € succ(n) do 
| push(worklist, n’); 
7 return OUT|[exit(CFG)]; 
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4.3 Implementation 


We implemented our approach for sub-Turing island identification in a tool called Endeavour, 
which is written in Python. Endeavour takes as input an Android application and returns a report 
that includes the analysis result together with other statistics. Endeavour accepts Android apps 
directly in binary (APK) format. It uses Androguard! to parse and decompile the APK files as 
well as generate the control flow graphs. Hence, Endeavour does not require source code. We 
use our own intermediary representation for instructions which has a lisp-like format. One key 
phase in Endeavour is loop extraction (Section 3.2), which extracts a list of loops, each of which 
is identified by its header together with the nodes it contains. It also obtains the hierarchical 
(domination) relation between loops. Finally, Endeavour implements the over-approximation of 
the call graph based on the class hierarchy approach [59] (Section 4.1). Endeavour is available at 
to.be@posted.post.final.acceptance. 


5 EXPERIMENTAL RESULTS 


This section empirically investigates six research questions involving sub-Turing islands, hence- 
forth abbreviated ST-islands. We start by overviewing the application corpora that makes up our 
experimental subjects. The investigation then begins by considering the prevalence of ST-islands. 
Simply put if ST-islands are rare then their study is of little practical value. We next take a deeper 
look in into the main causes of divergence. Finding API methods the dominant source, we consider 
the impact of safe listing subsets of the API methods. Then turning to two of the many applications 
of ST-islands, we consider first the relationship between bug density in the swamp and on the 
ST-islands, and second the percentage of verification conditions, such array bound violations and 


1https://github.com/androguard/androguard 
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null object dereferences that occur on ST-islands. Finally. we consider the runtime efficiency of our 
tool Endeavour. 

In the experiments, unless otherwise stated, we make the following assumptions. First, we 
discarded getters and setters as we assume that they are implemented in a standard way making 
them trivially sub-Turing. In addition, we initially assume that all API calls diverge and bind L to 
all variables they may write or that depend on them. 

We study two sets of apps. A large dataset, APP_BIN, of over one thousand apps, for which 
source code is unavailable, and a smaller set, app_src, of ten apps, for which full source code is 
available. Both corpora, are composed of a range of real world production apps to ensure that our 
empirical scientific findings have high external validity. The APP_BIN dataset is composed of 1100 
Android applications uniformly selected from more than 600 000 apps collected from the Androzoo’. 
Androzoo apps have diverse origins, including the Google Play, store which is the predominant 
source the apps we study. Our set of 1100 apps contains more than 2 million methods. The APP_SRC 
dataset is composed of ten applications selected from Github under certain criteria that we describe 
later. We only consider this dataset in the experiment described in Section 5.4, which requires the 
app source code. In all other experiments, we consider the larger ADD_BIN set. 


5.1 Landscape of ST-islands 


First of all it is important to know the proportion of code that resides within ST-islands. The 
answer to this question suggests the code size over which we can reason precisely. A significant 
proportion means that it is worth investing in the improvement of static analysis as the benefit 
may be substantial. So the first research question we address is the following: 


RQ1: What is the proportion of code occupied by ST-islands? 


The results using APP_BIN are summarized in Figure 9 where the left boxplot shows the distribu- 
tion of ST-method percentages. 


Finding 1a: Overall, the average percentage of ST-methods in an app is approximately 55%, 
hence, the majority of methods are sub-Turing. 


To study the impact of code size on our results, we want to exclude trivial methods. Defining 
trivial is hard. We conservatively consider methods of fewer than ten lines as trivial. To convert 
lines into bytecode instructions, we averaged method length in bytecode instructions over its 
non-comment source code length and found that on average each line of source code generates 
three bytecode instruction. Thus, we consider a method trivial if it includes fewer than thirty 
bytecode instructions. The results when considering only non-trivial methods are shown on the 
right of Figure 9. 


Finding 1b: Discounting trivial methods, the percentage of ST-methods is 22%, which while 
lower than overall average, still represents a significant portion of the code. 


While the percentage of ST-methods drops, it remains significant as it represents almost a 
quarter of each app. Moreover, our analysis is both sound and efficient, hence the percentage under 
estimates the true proportion of code that lies in non-trivial sub-Turing islands. A more precise 
but less efficient analysis can only ever uncover additional sub-Turing methods. Hence, this result 
underscores the value of investing in static analysis tools specialized to exploit ST-islands. 


“https://androzoo.uni.lu 
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Fig. 9. Percentage distribution of ST-methods in our 1100 apps, discarding getters and setters (left boxplot). 
In addition to discarding getters and setters we also discard methods with less than 30 bytecode instructions 
(right boxplot). The average percentage of ST-methods in the first case (left) is 55% and it is 22% for the second 
case (right). 


5.2 Causes of Divergence 


Understanding the causes of divergence informs us about prevalent reasons of precision loss. For 
example, if it turns out that a certain language construct is the dominant cause of divergence, then 
we might want to give it greater attention in future work. Therefore, we seek an answer to the 
following research question: 


RQ2: What are the main causes of divergence? 


To answer this question, we refined our analysis by extending the abstract domain with an 
element indicating the cause of divergence: API call, loop, or recursive method. 


Finding 2: Over corpus APP_BIN, we classify the sources of divergence as following 


api loop recursion 
76% 13% 11% 


We can see that over three quarters of the divergence is due to library API calls. This suggests that 
a more precise modelling of API calls is likely to improve the precision of a given static analysis. 
We set out to experimentally investigate this hypothesis in the next section. 


5.3 API Safe Listing 


Cook is very conservative as it assumes that all API calls cause divergence. In practice, many called 
API methods have a quite well-understood and documented behaviour, making it is plausible to 
assume that calls to such API methods are not a source of divergence. In this section, we test the 
impact of this possibility in the following research question: 


RQ3: How does a more precise modelling of APIs impact the analysis? 
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Fig. 10. Percentage distribution of ST-methods considering a safe list of most frequently used APIs. The x-axis 
shows the size of the safe list as a percentage of most frequently used APIs. Chart (a) shows box plots for all 
1100 apps, discarding getters and setters while Chart(b) also discards methods with fewer than 30 bytecode 
instructions. 


We define a safe list of most frequently used APIs which are assumed to not induce divergence. 
Among the selected APIs are methods from the Java standard library and some Android frequently 
used API methods. Under this setting, we repeat the experiments of Section 5.1, where we vary 
the size of the API safe list. Results are shown in Figure 10. Figure 10a, shows the percentage of 
ST-methods per app for different sizes of the API safe list while Figure 10b considers only methods 
with more than 30 bytecode instructions. Results when using an empty API safe list repeat the data 
shown in Figure 9. We included them as a baseline. At the other end placing all API methods on the 
safe list allows us to investigate the impact of a developer who seeks to focus the analysis solely on 
his or her code. 


Finding 3: For a safe list containing just 5% of most frequently used APIs, the average per- 
centage of ST-methods grows to almost 80% when all methods are considered and just over 
50% when only methods containing more than 30 bytecode are considered. 


Here a safe list of only 5% of the frequently used APIs yields an important increase in ST-methods. 
Interestingly, increasing this to 10% has minimal impact, which may be an instance of the way the 
most frequently used calls tend to distribute as a power law. Finally, including all APIs on the safe 
list causes 88% of all methods and 66% of all non-trivial methods to be ST-methods. The trend here 
hints at the value in techniques such as providing formal summaries for the common API methods. 


5.4 Distribution of Bugs over ST-Islands 


It is interesting to check whether there is a correlation between bugs and ST-islands. We address 
this possibility in the following research question: 


RQ4: Is there a significant difference in the bug distribution in the swamp compared to the ST-islands? 
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app A LOC BDsr(A) BDsw(A) 
bitcoinwallet 23392 0.5 1.0 
connectbot 26 625 0.0 18.4 
irccloud 57 471 5.7 12.0 
k9 123 606 55.5 59.3 
mgit 10919 11.7 13.1 
orbot 18 772 0.0 1.1 
owncloud 63 495 48.1 57.2 
signal 92 868 29.2 18.8 
vic 69 976 8.6 10.7 
worldpress 128 433 24.1 22.7 


Table 2. Bug Density in ST-methods and the swamp for 10 Android open source projects given as number of 
bugs per kilo line. 


Investigation of this research question requires application source code; thus we make use the 
APP_sRC collection, which was collected under the following constraints: 


e Open source: we need the code of the application as well as the corresponding repository 
to perform the experiment. 

e Repository history: to rule out simple weekend projects. 

e Non-trivial size: to rule out small toy applications. 

e Number of application installations: we want the apps to have real users, thereby attest- 
ing to their practical use. 


The resulting app_src collection includes the ten real-world applications shown in Table 2. 
We compute bug density for ST-methods and swamp methods using the following steps: 


e To identify bugs and their corresponding locations, we use a heuristic based on a bag of 
words. We check the presence of certain commits associated with keywords such as "bug", 
"fix", etc. in the git repository of each application. We call such commits bug-fix commits. 
A buggy line is any line removed, added, or modified by a bug-fixing commit. A method is 
buggy if it contains a buggy line. We assume that a single bug is associated with a single 
commit and write bugs(m) to express the number of bugs associated with method m. 

e As our analysis is at the bytecode level, we compile the original source code of each app 
considered to obtain a binary APK file to analyse. 

Finally, we compute bug density for ST-methods and swamp methods. The bug density for 
an application A, BD(A), is defined as 


al bugs(m) 
PDA = JA] 2 LoC(m) 


where LoC(m) is the number of lines of code in method m and |A| the number of methods in 
A. We respectively denote the bug density for ST-methods and swamp methods as BDsr(A) 
and BDsw(A). 

Overall in App_src there are 6906 ST-methods comprised of 475 KLoC with 1863 bugs, and 
7417 swamp methods comprised of 894 KLoC with 5317 bugs. We compare bugginess statistically 
using the non-parametric Wilcoxon test at first the method level and then the line level. The 
average bugs per method of 0.27 for ST-methods and 0.72 for the swamp are statistically different 
(p < 0.0001). Because swamp methods tend to include more lines of code, we also compare the two 
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using bugs-per-line. In this case, the 0.0265 for ST-methods is again statistically less than the 0.289 
for the swamp (p < 0.0001). Table 2 breaks these bug density out by program. 

Finally, we use generalized linear models to investigate the question “How likely is a method to be 
buggy?” where a method is considered buggy if it contains one or more bugs. A method’s bugginess 
forms each model’s response variable. Generalized linear models enable us to consider multiple 
explanatory variables as well as binary response variables. In the first model, we use ST-island as 
the sole explanatory variable. With an odd ratio of 2.07, the model predicts that a swamp method is 
over twice as likely to contain a bug when compare to an ST-island method (p < 0.0001). Including 
program as an additional explanatory variable, which enables the model to account for differences 
between programs, increases the odds ratio to 2.09. The impact of additionally including lines as an 
explanatory variable is negligible with or without the program variable. Finally, it is interesting that 
there is no significant interaction between program and a method being an ST-method; thus, the 
likelihood of being an ST-island method is independent of the program. This unexpected uniformity 
strongly supports the external validity of our findings. 


Finding 4: The bug densities for ST-islands are statistically smaller than that of the swamp 
(p < 0.001). 


From the above statistics, bug density tends to be higher in the swamp. This result further supports 
our suggestion to use the swamp as a hint for guiding bug search. In other words, one should 
allocated a limited budget (time, resources, etc.) to the swamp than to the ST-islands. 


5.5 Finding Potential Errors 


ST-islands are portions of code about which we can precisely answer whether a given property holds. 
We would like to investigate the presence of concrete properties falling into ST-islands on which 
program safety relies. One such property is a runtime exception such as an array out-of-bounds 
and null-object dereferences. We address the following research question: 


RQ5: What is the percentage of verification conditions related to detecting bound violations and null 
object dereference runtime errors that occur in ST-islands? 


We studied the spread of these two potential runtime exceptions over ST-islands in our APP_BIN 
corpus of 1100 applications. We count all array accesses and object dereferences in the code and 
compute the proportion of the ones occurring in ST-methods for each application. 

The results, presented in Figure 11, show that just over one in three exceptions can be precisely 
checked at compile time because it lies on a sub-Turing island. This is a lower bound for our 
corpus of 1100 apps, because our determination of sub-Turing islands is a safe under-approximation. 
Moreover, as visible in the violin plot (Figure 11), the percentage of array accesses and object 
dereferences is around 80% for a notable number of apps. 


Finding 6: A lower bound on the average percentage of sub-Turing array accesses and object 
dereferences in our corpus is 37%. 


5.6 Analysis Performance 


We have established that non-trivial portions of real world Android app code lie in sub-Turing 
islands and have demonstrated that this has implications for bug density and verification in an 
empirical analysis. Finally, we report on the computational cost of identifying sub-Turing islands 
using our approximation. While many other techniques for approximation could be used, and 
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Fig. 11. Percentage distribution of array accesses and object dereferences in ST-methods per application. The 
x-axis depicts a kernel density plot of the data, mirrored around the plot’s central line. Intuitively, kernel 
density captures the likelihood that the y-axis has this value. 


should be explored in future work, it is useful to know whether, at least one such analysis exists that 
is scalable. If we are able to provide evidence that our approximation is computationally feasible 
and, therefore, that there does exist a scalable useful approximation to sub-Turing islands, this will 
further underscore the practical value of sub-Turing analysis. 


RQ6: Can ST-islands be efficiently identified? 


We measured Endeavour’s analysis time from parsing an application to delivering its output on 
a 3.2GHz Intel Core i5 quad-core processor with 8GB of memory, running Linux. The results show 
that our approach is scalable to real-world applications. 


Finding 7: Endeavour takes less than four minutes for even the largest applications studied, 
containing more than 40 000 methods. 


6 RELATED WORK 


Our analysis marries taint analysis with termination reification (as divergence). Taint analysis is 
a technique used in software security [5, 22, 27, 61, 65]. The goal of taint analysis is to show the 
absence of information leaks from a set of given sources to a set of given sinks. It can be performed 
statically [5, 27, 61, 65] or dynamically [22]. Our bottom-up inter-procedural data-flow analysis 
is a flow-sensitive taint analysis that takes into account implicit information flows due to control 
dependencies. In our case, sources are divergent constructs. Our work also relates to various other 
topics, including invariant generation, loop summarization, bounded model checking, termination 
analysis, strictness analysis and program slicing. 


6.1 Loops 


As loops are a key component in our study, we consider work from the literature aimed at their 
analysis. 
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Summaries. Our modelling of potentially non-terminating loops consists of assigning diver- 
gence values to variables they possibly modify. Loop summarization techniques allow to infer 
loop-free code that soundly approximates a given loop. 

Sharygina and Browne proposed a syntactic transformation for abstracting branches in loops 
in a UML dialect (design level) [55]. Kroening and Weissenbacher proposed an approach based 
on associating recurrence equations with loop variables and then computing a closed form for 
each equation. Kroening et al. [39] a proposed related technique for replacing code fragments, 
including loops, with corresponding abstract transformers that play the role of the summaries. 
Seghir proposed a lightweight technique for inferring loop summaries over array segments as well as 
simple variables using a set of inference rules [53]. Xie et al. presented a technique for summarizing 
loops that contain multiple paths and manipulate strings, with conditions over string content [70]. 
They further extended their work to support disjunctive reasoning [69]. Loop summarization can 
be folded into our approach to increase the number of loops that can be statically determined to 
terminate by construction. 


Invariants. One approach for reasoning about loops in the context of program verification is 
through loop invariants [33]. Many verification tools rely on manually provided invariants [7, 19, 25]. 
However, the literature is rich in terms of approaches that automatically infer invariants in various 
domains: arithmetic [15, 37, 41] (linear), [52] (non-linear), arrays [29, 36, 57] and heaps [47, 51]. 
Software model checkers attempt to build invariants automatically, during the verification process [6, 
12, 13, 32, 34, 46], relying on a popular technique called predicate abstraction [28]. We can use 
invariants to express state changes (transitions) by introducing fresh variables to symbolically 
model initial values of variables. Hence, similar to summaries, we can use them to express the effect 
of a given loop, which should improve our algorithm’s precision. 


Termination. Termination is another issue related to loops. Knowing the after-state of a given 
loop is only possible when the loop terminates. Therefore, we model the effect of potentially non- 
terminating loops by assigning a divergent value to potentially modified variables. The literature is 
rich with work regarding termination analysis [16, 43-46, 62]. So-called ranking functions [43, 62] 
and transition invariants [16, 44-46] are one of the key approaches proposed to show termination. 
They both express relationships over program states modeling the progress of variables. From a 
more pragmatic perspective, showing termination of loops via simple arguments (analysis) has also 
been studied [26]. Integrating loop analysis with our approach would help us mitigate precision 
loss. 


Bounded Model Checking. Bounded model checking (BMC) is a technique that deals with 
loops in a systematic manner by simply unrolling (simulating) them [14, 17, 23]. The unrolling 
process may eventually result in a loop-free code fragment that exactly models the original loop’s 
effect on program variables. Unfortunately, such an approach does not work for loops that are not 
explicitly bound as the unfolding process will not terminate. Nonetheless, we can combine BMC 
with our approach to improve our reasoning precision by restricting its application to loops with 
explicit bounds and apply other techniques to those that are not. 


6.2 Slicing 


Program slicing is a technique proposed by Weiser [67] to extract a set of statements, called a slice, 
that influence a specified computation of interest, referred to as the slicing criterion. The semantics 
of the original program are preserved by the slice with respect to the slicing criterion. There has 
been a tremendous amount of work on slicing and its applications [9]. While the original proposal 
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statically defined a slice, a dynamic variant has been proposed as well [2]. In the latter, a slice is a 
set of statements that affect the slicing criterion with respect to a particular input. 

Slicing has been applied to various problems: program debugging [1], testing [31] comprehen- 
sion [38], re-use [11], and re-engineering [49]. While the original proposal is syntax-preserving 
(i.e., the statements of the slice are all taken from the original code), some variants amorphous [30], 
allowing changes to the program syntax as long as the program semantics are preserved with 
respect to the criterion. In the context of software model checking, path slicing was proposed to 
find statements in a given path that are relevant to show its (in)feasibility [35]. Slicing has also 
been used to reduce the number of interlivings in event-oriented applications [10], and recently it 
has been combined with runtime analysis to extract values of variables that make an application 
difficult to statically analyse [48]. 

Our approach shares with slicing the characteristic of relying on dependency analysis. Moreover, 
our analysis naturally yields sub-Turing slices (i.e., portions of the program that are sub-Turing). 
We obtain them by simply backtracking paths in the control flow graph of a given method and 
selecting statements that are not affected by divergent values. 


6.3 Strictness Analysis 


Similar to our approach, strictness analysis has been proposed to track divergence resulting from 
non-termination and error causing program crashes, such as division by zero. A function is said to 
be strict if it diverges whenever one of its parameters diverges. A variant of strictness analysis, 
joint-strictness, takes into account parameter combinations. A function is jointly-strict in a subset 
of its arguments if it divergences when all the arguments of the subset diverge. Mycroft proposed 
an approach to approximate the divergence relationship induced by a given function over its 
parameters and the result it returns [42]. The approach relies on an underlying forward abstract 
interpretation [18]. A backward analysis has been implemented into the Glasgow Haskell Compiler 
to perform strictness analysis in a demand-driven fashion [54]. Other forms of strictness analysis 
have been proposed in the literature. For example, Wadler and Hughes describe several projection- 
based strictness [64], such as head-strictness and tail-strictness, refining the original basic definition. 

However, a function being sub-Turing neither entails strictness nor the other way around. Indeed, 
if a function always diverges regardless of its parameters, it is strict but not sub-Turing. On the 
other hand, the function f(x,y){if x return 1 else return y} is sub-Turing but not strict. It 
is sub-Turing as it does not contain any divergent construct. However, it is not strict because in 
case x is true the function does not diverge even if y diverges. 


6.4 Pointer Analysis 


Pointer analysis aims at determining the set of memory locations a pointer may refer to during 
program execution. Two popular pointer analysis that constitute the basis of many other approaches 
are Steensgaard’s [58] Andersen’s [3]. While Steensgaard’s analysis does not take into account 
the direction of flow of values induced by assignments, Andersen’s approach models assignment 
direction. Therefore, Steensgaard’s technique offers more scalability while Andersen’s provides 
more precision. Das proposed an algorithm lying between Andersen’s and Steensgaard’s approaches 
[21]. It is scalable and, at the same time, its precision is very close to Andersen’s 

Lhoták and Hendren introduced the SPARK framework [40] that offers building blocks for 
implementing various pointer analysis for Java. 

Sridharan et al. proposed a pointer analysis variant which is suitable for environments with 
small time and memory budgets [56]. Their approach is demand-driven, i.e., performs only the 
work necessary to answer a query issued by a client. 
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Instead of applying a pointer analysis, we soundly handle aliases using the variable representative 
idea inspired by Sundaresan et al [59] (3.1). We plan to empirically study the impact of pointer 
analysis on Cook. 


7 CONCLUSION 


In this paper, we addressed the empirical question of how often a program analysis question has, in 
practice, an exact solution. To this end, we introduced sub-Turing islands, which are portions of code 
in which any question of interest is decidable. We provided a formal definition of sub-Turing islands 
and presented an algorithm for identifying such islands in applications. We have implemented 
our approach in a tool called Endeavour and applied it to a representative corpus of 1100 Android 
applications. 

Our empirical study revealed that sub-Turing islands make up 55% of the methods in the 1100 
Android apps studied. These results are not merely of theoretical interest, but have practical 
ramifications in software engineering. Our findings suggest that we can provide more precise 
assessments of test coverage; that we can expect more precise assessments of change impact analysis; 
that we can hope for more precise slices, and thereby, more precise re-use, better comprehension, 
and better re-engineering interventions. For example, in the code on which we report, 37% of 
runtime-exception guards reside within sub-Turing islands. This means that an exact answer 
regarding the validity of these guards can be statically determined. 
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